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We study maximum likelihood estimation in log-linear models un- 
der conditional Poisson sampling schemes. We derive necessary and 
sufficient conditions for existence of the maximum likelihood estima- 
tor (MLE) of the model parameters and investigate estimability of the 
natural and mean-value parameters under a nonexistent MLE. Our 
conditions focus on the role of sampling zeros in the observed table. 
We situate our results within the framework of extended exponential 
families, and we exploit the geometric properties of log-linear models. 
We propose algorithms for extended maximum likelihood estimation 
that improve and correct the existing algorithms for log-linear model 
analysis. 

1. Introduction. Log-linear models are arguably the most popular and 
important statistical models for the analysis of categorical data; see, for 
example, Bishop, Fienberg and Holland (1975), Christensen (1997). These 
powerful models, which include as special cases graphical models [see, e.g., 
Lauritzen (1996)] as well as many logit models [see, e.g., Agresti (2002), 
Bishop, Fienberg and Holland (1975)], have applications in many scientific 
areas, ranging from social and biological sciences, to privacy and disclosure 
limitation problems, medicine, data-mining, language processing and genet- 
ics. Their popularity has greatly increased in the last decades due to growing 
demands for analyzing databases taking the form of large and sparse con- 
tingency tables, where most of the cell entries are very small or zero counts. 
Despite the widespread usage of these models, the applicability and statisti- 
cal properties of log-linear models under sparse settings are still very poorly 
understood. As a result, even though high-dimensional sparse contingency 
tables constitute a type of data that is common in practice (e.g., in sam- 
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pie survey applications), their analysis remains exceptionally difficult; see 
Erosheva, Fienberg and Joutard (2007) for such an example. 

In this article we are concerned with statistical inference in log-linear 
models of arbitrary dimension, and, in particular, with conditions for the 
existence of the maximum likelihood estimator, or MLE, of the model pa- 
rameters. In log-linear model analysis, virtually all methodologies for as- 
sessment of fit, model selection and interpretation are applicable and have 
theoretical validity only provided that the MLE exists. Though this may ap- 
pear to be only a computational issue, in fact, when MLE is not defined, the 
applicability of statistical procedures routinely used by practitioners may no 
longer have a theoretical justification and, at the very least, require alter- 
ation. The statistical implications of a nonexistent MLE, some of which are 
detailed below, are numerous and severe. 

• Existence of the MLE is required to justify the use of large sample % 2 ap- 
proximations to numerous measures of goodness-of-fit commonly utilized 
for model assessment and model selection; see, for example, Bishop, Fien- 
berg and Holland (1975), Agresti (2002), Read and Cressie (1988). When 
the MLE does not exist, the standard regularity conditions used to derive 
such approximations no longer hold. As we show below, under a nonexis- 
tent MLE, the model is not identifiable, the asymptotic standard errors 
are not well defined and the number of degrees of freedom becomes mean- 
ingless. Though existence of the MLE is by no means enough to warrant 
the use of x 2 approximations, nonexistence will surely make them inade- 
quate. 

• Existence of the MLE is also needed to derive a limiting distribution for 
the double-asymptotic approximations of the likelihood ratio and Pear- 
son's x 2 statistic for tables in which both the sample size and the number 
of cells are allowed to grow unbounded, a setting studied, among others, 
by Morris (1975), Haberman (1977) and Koehler (1986); see also Read 
and Cressie (1988). 

• The issue of nonexistence is also important for Bayesian analysis of log- 
linear models; see, for example, King and Brooks (2001), Massam (2009), 
Dobra and Massam (2010) and references therein. Indeed, we will demon- 
strate that nonexistence of the MLE is due to the data not being fully 
informative about the model parameters, and results in nonestimability 
of those parameters. Since the nonexistence of MLEs is due to insufficient 
data, it cannot be remediated. In particular, the use of Bayesian methods 
in cases in which the MLE is nonexistent is equivalent to replacing the 
information content lacking in the data with the information contained in 
the prior. Since for some parameters no learning from the data takes place, 
the posterior distribution must be interpreted accordingly. Furthermore, 
when one uses improper priors for the log- linear parameters, the posterior 
may be also be improper when the MLE does not exist; see Forster (2004). 
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It has long been known [see, in particular, Birch (1963), Haberman (1974), 
Bishop, Fienberg and Holland (1975)] that the nonexistence of the MLE is 
caused by sampling zeros. When certain patterns of zero counts occur in 
the observed table, the log-likelihood function cannot be maximized by any 
vector of finite norm. While for hierarchical log-linear models, patterns of 
sampling zeros leading to null margins are well known to cause nonexistence 
of the MLE, very little has been known or observed about general patterns 
of sampling zeros associated with nonexistent MLEs. The very few know 
examples described in Haberman (1974), Fienberg and Rinaldo (2007) and 
Dobra et al. (2009) suggest that nonexistence of the MLE may occur in 
small tables, but is very likely to arise when the table is large and sparse. 

Haberman (1974) first obtained necessary and sufficient conditions for 
the existence of the MLE for log-linear models. Eriksson et al. (2006) gave 
a direct geometric interpretation of Haberman's conditons and proposed 
a polynomial time algorithm for checking for the existence of the MLE. 
Aickin (1979) and Verbeek (1992) refined Haberman's conditions by recast- 
ing the problem within the frameworks of exponential families and of gen- 
eralized linear models, respectively. In fact, the issue of nonexistence of the 
MLE is best dealt with using the general theory of exponential families 
and, in particular, of extended exponential families, originally put forward 
by Barndorff-Nielsen (1978) and then Brown (1986). See also the impor- 
tant work by Cencov (1982). In a recent series of papers, Csiszar and Matus 
(2001, 2003, 2005, 2008) broadened significantly the notions of extended ex- 
ponential families and extended maximum likelihood estimation to include 
very general settings under minimal assumptions. See, in particular, Re- 
mark 5.9 in Csiszar and Matus (2008), which briefly point to the connections 
with the theory of log-linear models. Rinaldo, Fienberg and Zhou (2009) and 
Geyer (2009) contain more specialized results directly relevant to the log- 
linear settings. Adopting a different approach, Lauritzen (1996) defined the 
parameter space for log-linear models as the point-wise limit closure of the 
log-linear model parameters, which he calls the extended log-linear model, 
and effectively treats the MLE and extended MLE as one entity. While this 
is theoretically convenient, the issue of nonestimability of the model param- 
eters is not resolved, and the computation of the extended MLE is just as 
problematic. Finally, Nardi and Rinaldo (2012) provided asymptotic condi- 
tions under which, for a hierarchical log-linear model, a penalized maximum 
likelihood estimator based on the group-lasso penalty will return the correct 
model, with high probability. 

Despite the breadth of the cited literature, two key issues concerning max- 
imum likelihood estimation in log-linear models remain. First, the properties 
of extended exponential families have not yet been specialized to the case 
of log-linear models. In particular, direct application of this theory does not 
yield, in general, usable conditions for the existence of the MLE, and the 
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identification of the nonestimable log-linear parameters or of the patterns 
of zeros leading to a nonexistent MLE are still open problems. Secondly, 
existing theoretical results have not been incorporated yet in any numerical 
algorithm for checking for existence of the MLE and for identifying nones- 
timable parameters. Consequently, virtually all statistical software currently 
available to practitioners is flawed, to the point that nonexistence of the MLE 
can be detected only by monitoring whether the algorithm used to optimize 
the log-likelihood function fails to converge, or converges slowly or becomes 
unstable; see, for example, Fienberg and Rinaldo (2007). Consequently, re- 
sults and decisions stemming from the statistical analysis of contingency 
tables containing substantial numbers of zero counts can be seriously com- 
promised. 

In this article we attempt to rectify these problems. Our contributions 
are two-fold: 

• From a theoretical standpoint, we derive necessary and sufficient condi- 
tions for existence of the MLE that are broadly applicable to a variety of 
sampling schemes and amenable to computations. Ultimately, these con- 
ditions amount to checking whether the observed sufficient statistics lie 
on the boundary a polyhedral cone, called the marginal cone; see Eriksson 
et al. (2006). When the MLE does not exist, we specialize the theory of 
extended exponential families to characterize the estimability of the nat- 
ural and mean-value parameters of the log-linear models. To this end, we 
focus on discrete exponential families with polyhedral convex support [see 
Rinaldo, Fienberg and Zhou (2009), Geyer (2009)], and rely significantly 
on tools from polyhedral geometry. 

• From a practical viewpoint, we develop algorithms for extended maximum 
likelihood estimation that are applicable to large tables. Our procedures 
will allow one to (i) detect nonexistence of the MLE and (ii) identify and 
estimate all the parameters that are in fact estimable. Overall, our al- 
gorithms correct and improve over many existing software for log-linear 
model analysis. Due to space constraints, a detailed description of these 
algorithms is contained in the supplementary material [Fienberg and Ri- 
naldo (2012)]. 

Notation. We let I be a finite set of indices or cells, representing the 
support of a discrete distribution, such as the joint distribution of a set of 
categorical variables. We set I =\I\, where \B\ is the cardinality the set B. 
We denote by be the vector space of real- valued functions on X, and R> 
and N 2 " its subset of nonnegative functions and nonnegative integer-valued 
functions, respectively. For vectors x and y, (x,y) = x T y represents their 
inner product and ||x|| = V x T x the corresponding Euclidean norm. If x € 
M 2 ", we denote by x(i) the value corresponding to the ith coordinate of x and 
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by supp(x) = {i :x(i) 7^ 0} the set of coordinates of x with nonzero values. 
We take functions and relations on vectors component- wise, for example, for 
x G R x , exp(x) = {e x W : i G X}. 

For a nonempty subset JCI, we let 7rjr : — > R-^" the coordinate pro- 
jection map given by {x(i) : i £ X} 1— >■ {x(i) : i G J 7 } and, for any S C M. x , we 
set irjr(S) = {7rjr(x),x G 5}. If M. is a linear subspace, we denote by M. 
its orthogonal complement and by Hm the orthogonal projector into Ai. 
If J\f is another linear subspace contained in A4, we write MQj\f for the 
subspace M n . 

For a matrix A, 7£ (A) denotes its column range and kernel (A) its null 
space. If the rows of A are indexed by X, and T is a nonempty subset of X, 
Ajr is the submatrix of A comprised of the rows with indexes in T . We write 
cone(A) for the polyhedral cone spanned by columns of A and conv(A) for 
the polytope consisting of the convex combinations of its columns. Similarly, 
for a set S, conv(S') is the convex hull of all its points. For a polyhedron P, 
we write its relative interior as ri(P). 

2. Log-linear models, sampling schemes and exponential families. Log- 
linear model analysis is concerned with the study of discrete probability 
distributions over a finite set X, whose elements will be referred to as cells. 
These distributions are assumed to form an exponential family of probabil- 
ities {Priori G R rf } with densities with respect to the counting measure on X 
of the form 

(1) p v (i) = P ri ({i}) = exp{( V , ai )-^(r 1 )}, r] G R rf , 

where each a^ is a nonzero vector in R d , and 4>{r]) = log(^ exp{(?7, a^)}) 
is the log-partition function. The I x d matrix A, whose ith row is the 
vector aj , is called the design matrix. 2 

Suppose we observe a sample of N independent and identically distributed 
realizations from an unknown distribution satisfying (1), where the data take 
the form of an unordered sequence of random cells (Li, . . . , Xjy), with Lj G X 
for each j, and where N too can be random. The observed cells are then 
cross-classified into a random integer vector n G N 1 , called a a contingency 
table, with n(i) = \ {j : Lj = i}\, for all i G X. 

Traditionally, log-linear model analysis is not directly concerned with 
the natural parameters rj in (1), but rather with the unknown expected 
value m := E[n] of the resulting contingency table, under the provision that 
m(i) > for each i. In detail, letting A4 C R 1 be the linear subspace spanned 
by the rows of the design matrix A, the ensuing log-linear model is predicated 



2 It is easy to see that design matrices are not uniquely determined: if Ai and A2 are 
two matrices of dimensions I x di and I X d,2, respectively, and with identical row spans, 
then they parametrize the same statistical model. 
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on the condition that n := log(m) € M. In particular, log-linear models are 
typically defined as statistical models for the distribution of the random 
table n indexed by the points in the linear subspace A4. 

The distribution of the table n depends on the sampling scheme used 
during the data collection process. In this article, we study sampling schemes 
based on linear restrictions on n, known as conditional Poisson sampling 
schemes, introduced in Haberman (1974), Chapter 1. Specifically, let J\f 
be a given m-dimensional linear subspace of M, which we will refer to as 
the sampling subspace, and c a known vector in R . The corresponding 
conditional sampling Poisson scheme prescribes that the distribution of n 
is given by the conditional distribution of / independent Poisson random 
variable {n(i),i £ 1} with mean parameters {m(i) = exp(/z(z)), i £ X}, where 
H £ M, given that ILvn = c. This type of data sampling includes the most 
commonly used sampling schemes, described below. 

• Poisson sampling scheme. The sampling subspace is J\f = {0}. Thus, there 
are no restrictions on n, which is a random vector comprised of indepen- 
dent Poisson random variables with mean m. The log- likelihood function 
is given by 

(2) ^(/x) = (n, / x)-l T exp(/x)-^logn(i)!, fxeM. 

i 

• Product multinomial and multinomial sampling schemes. Let B\, . . . ,B m 
be a partition of I. Under the product multinomial sampling, the condi- 
tional distribution of the cell counts n is the product of m independent 
multinomials of sizes Nj, j = 1, . . . ,m, each supported on the correspond- 
ing class Bj. Formally, let Xj be the indicator function of Bj, where Xj{i) 
is 1 if i £ Bj and otherwise, and define M to be the r-dimensional 
subspace spanned by the orthogonal vectors {Xij ■ ■ ■ iXr)- The product 
multinomial sampling constraints are of the form (n, Xj) = Nj, for known 
integer constants Nj. The log-likelihood function is [see Haberman (1974), 
equation 1.51] 

(3) 

where m = exp(/x). Because of the sampling constraints, i M is well defined 
only on the subset of M. , 

(4) M :={ f xeM:(Xj,exp{ f j,)) = N j ,j = l,...,r}, 

which is is neither a vector space nor a convex set. We give a more con- 
venient parametrization below in Lemma 2. The multinomial scheme is 
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a special case of product multinomial schemes, corresponding to the triv- 
ial one-class partition of X with indicator function 1. In this case, n has 
a multinomial distribution with size N = (l,n) = (l,m) and cell proba- 
bilities m/N. 

• Poisson-multinomial sampling schemes. This sampling scheme is a combi- 
nation of the previous two schemes. For a given partition B\, . . . , B m of I, 
the sampling constraints are of the form (n, Xj) = Nj for j = 1, . . . , m — 1, 
with the counts for the cells in the set B m left unconstrained; see Lang 
(2004, 2005). 

As is customary, we assume throughout that the sampling subspace N 
is strictly contained in Ai . The case J\f = Ai is practically uninteresting, 
as the resulting sampling constraints would fix the value of the sufficient 
statistics so that the conditional distribution of n will not depend on the 
model parameters. We treat the case Af <£_ Ai in the supplementary material 
[Fienberg and Rinaldo (2012)]. 

We now derive the equivalent exponential family representation for log- 
linear models under conditional Poisson schemes. To this end, we will express 
the sampling constraints in a different, but equivalent, form. Let (vi, . . . , v m ) 
be any set of m vectors spanning Af and such that (vj, c) = 1 for all j. Then, 
the sampling constraints take the form 

V T n = l, 

where V is the I xm matrix whose j th column is v,- . Accordingly, we denote 
with 

5(V) :={xGN X :V T x = l} 

the set of all possible tables compatible with the sampling constraints spec- 
ified by V. Let v be the finite measure on given by 3 

"W-n^y,. 

For a conditional Poisson scheme defined by V, let v\j be the restriction of v 
on S(V), that is, ^v( x ) := 1xgS(V)^( x )) with x £ N x . 

It is easy to see that the conditional distribution of the table n, given the 
sampling constraints determined by V, is the exponential family of distri- 
butions with base measure vy, sufficient statistics A T x, natural parameter 
space W 1 and densities given by 

(5) p e (x)=exp{(A T x,6»)- ^(0)}, x £ S(V), £ R d , 



This particular choice of the dominating measure will lead to Poisson and product 
multinomial likelihoods. More generally, much of our analysis carries over with other 
choices of dominating measure, for example, the ones for which conditions (A1)-(A4) 
in Rinaldo, Fienberg and Zhou (2009) hold. 
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where Y>(0) = log(f s ^ exp{(A T x, 6)} <ii/v(x)). This exponential representa- 
tion is not the most parsimonious from the viewpoint of sufficiency. Indeed, 
let T ={t£R d :t = A T x, x E 5(V)} be the image of A and // v = ^vA" 1 be 
the measure induced by A. Then, by standard arguments [see, e.g., Brown 
(1986)], the distributions of the sufficient statistics t = A T n also form an 
exponential family, with density with respect to the base measure fiy given 
by 

(6) g (t) = exp((t,0)-^(0)), t eT,OeR d , 

the same the log-partition function ijj and natural parameter space as in the 
original family. 

It is now easy to see that the exponential family parametrization and the 
log- linear parametrization are equivalent. Indeed, for any n and t such that 
t = A T n, and for any £ R rf , the identity 

(t,0) = (A T n,0) = (n,A0) = (n, M ), 

where /i = AO E M , implies these models can be equivalently parametrized 
by the linear subspace A 7 !. If A is of full rank, then the map \— > A6 
is an isomorphism between M. d and A4, while if d > dim(A / J), the natural 
parametrization is redundant and, in fact, nonidentifiable. 

Throughout this article, will impose the following assumptions. Let V be 
the matrix specifying the conditional Poisson sampling scheme. 

(AO) Nontriviality: the set S(V) is nonempty. 

(Al) Exhaustive sampling condition: there does not exist any vector 7 G 
J\[ \ {0}, such that (7,n) is constant almost everywhere with respect 
to ia/. In particular, for no cell i 6l, n(i) = 0, almost everywhere iaj . 

(A2) Integrality assumption: {x £ R> : Vx = 1} = conv(S'(V)). 

Assumption (Al) guarantees that no linear constraints hold, other than 
the ones specified by M, and it prevents the sampling constraints from in- 
troducing structural zeros. Even though we can easily extend our analysis 
to deal with structural zeros, we do not provide the details here. Assump- 
tion (A2) is technical, and it is used in Theorem 3 below to unify the con- 
ditions for existence of the MLE across different sampling schemes. If (A2) 
is not in effect, checking for existence of the MLE can become computa- 
tionally infeasible, depending on V. The Poisson, product multinomial and 
Poisson-multinomial schemes automatically satisfy (A2). 

2.1. The effects of sampling constraints. We conclude this section by 
studying the effect of the sampling constraints on the estimability of the 
natural and log-linear parameters. We show that imposing linear sampling 
restrictions results in nonidentifiability of the corresponding natural expo- 
nential family (5), to the extent that only certain linear combinations of the 
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natural parameters, which depend only on the subspace jV, are estimable. 
For the log-linear parameters, only Hj^Qj^fi is estimable, which implies that 
the number of estimable parameters is dim(.M Qj\f) = d — m. 

We define the following equivalence relation on ~R d : for 61,62 € 6\ ~ 62 
if and only if 6\ — 62 £ 2, where 

(7) Z:={CeR d :ACeM}. 

For any 6 G M d , we then write 6_\f := {6* : 6 ~ 6*} for the equivalence class 
containing 6, and G_v := {6j\f,6 S M. d } for the set of equivalent classes cor- 
responding to the equivalence relation ~. For simplicity, below we assume 
that the matrices A and V are of full rank, but the same conclusions hold 
with d replaced by rank(A). 

Lemma 1. Consider the exponential family (5), with A of full rank d, 
and suppose that conditions (AO) and (Al) hold. 

(i) The set Q_\f is a vector space of dimension d — m isomorphic to Ai Qj\f, 
and is comprised of parallel m- dimensional affine subspaces ofR d . 

(ii) The family is nonidentifiable: any two points 6\ ~ 62 specify the same 
distribution. In fact, this family is parametrized byQj\f, or, equivalently, 
by AiQN . Therefore, it is of order d—m. 

Using standard minimality arguments, nonidentifiability of the natural 
parameters can be easily resolved by redefining a smaller exponential family 
of order d — m using as a new design matrix any full-rank matrix whose col- 
umn span is Al Af; for this fully-identifiable family, the natural parameter 
space is R d ~ m . Concretely, we assume, without loss of generality, that the 
matrix A is of the form 

(8) A = (BV), 

where V is the I x m matrix of sampling restrictions whose rows span Af 
and B is a / x [d — m) matrix whose row space is AiQN . Then, replacing A 
with B in (5) will produce a full and minimal exponential family. 

To illustrate this point, we show that the log-likelihood function (3) 
for the product multinomial sampling scheme can be more conveniently 
parametrized by Ai N instead of the nonconvex set Ai . For any (3 € 
MQM, let 

m 

(9) £ M (/3):=(n,/3)-^iV j log(e X p(/3), Xj )-^log n (i)!. 

i=l i£T 

Lemma 2. The sets AiQAf are Ai homeomorphic and, for each pair of 
homeomorphic vectors n G M. and (5 G M. QM, lc{p) = £ AI ((3). 
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The form of the likelihood in (9) is better suited for computations, as we 
show in Fienberg and Rinaldo (2012). 

Under the conditions of Lemma 1, the Fisher information matrix at 6 has 
rank d — m, for each 6 £ W 1 . To see this, notice that the the Fisher infor- 
mation matrix 1(6) at 6 is Cove(A T n), where Covg denotes the covariance 
operator evaluated using the distribution parametrized by 6. Then, for any £ 
in the set Z defined in (7), the linear form (A T n, £) is constant almost ev- 
erywhere and therefore has zero variance. This is equivalent to C T I( d K = 0, 
so that rank(/(#)) = dim(2 _L ) = d — m, for all 6. 

3. Theory of maximum likelihood estimation. We now provide a sys- 
tematic treatment of maximum likelihood estimation for the natural and 
log- linear parameters, within the framework of the theory of discrete ex- 
tended exponential families with linear sufficient statistics. We refer the 
reader to Barndorff-Nielsen (1978) and Brown (1986) for classic references 
and Csiszar and Matus (2001, 2003, 2005, 2008) for advanced treatments. In 
our setting, Geyer (2009) and Rinaldo, Fienberg and Zhou (2009) are partic- 
ularly relevant. For the reader's convenience, we briefly review the aspects 
of this theory that are relevant to our problem in Appendix A. 

3.1. Existence of the MLE. We prove a general necessary and sufficient 
condition for existence of the MLE that applies to any conditional Poisson 
sampling scheme satisfying assumptions (A0)-(A2). Unlike existing results, 
these conditions directly translate into usable algorithms for checking for 
the existence of the MLE, as described in Fienberg and Rinaldo (2012). 

For any design matrix A, we denote by Ca := cone(A T ) the polyhedral 
cone spanned by the rows of A. Following Eriksson et al. (2006), we call Ca 
the marginal cone of A. 

Theorem 3. Assume conditions (A0)-(A2) and let A be any matrix 
with column span Ai. The MLE of 6j\f (or, equivalently, ofUj^Qj^fi) exists 
and is unique if and only if t = A T n G ri(CA)- 

This result is a nontrivial application of a well-known result about ex- 
istence of MLE in exponential families (viz., Theorem 13 in Appendix A), 
and it subsumes previous results of Haberman (1974) and Eriksson et al. 
(2006), because it provides a unified condition that applies to all condi- 
tional Poisson sampling schemes satisfying the integrality assumption (A2). 
To see how Theorem 3 differs from Theorem 13, a direct application of the 
latter yields that the MLE exists if and only if t belongs to the interior of 
the (d — m)-dimensional polyhedron 

C v := conv({t : t = A T x, x G N x , V T x = 1}). 

For Poisson sampling, this polyhedron is in fact the marginal cone, and, for 
multinomial sampling, it is the polytope {Vx:x £ conv(A)}. Under prod- 
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uct multinomial sampling, Cy is the Minkowsoki addition [see, e.g., Ziegler 
(1995), Schrijver (1998)] of m polytopes, one for each multinomial, while 
under Poisson-multinomial scheme it is the Minkowski sum of a polyhedral 
cone and as many polytopes as multinomial constraints. Even though it has 
smaller ambient dimension than the marginal cone, Cy is a geometric object 
that can be rather difficult to handle, both computationally and theoreti- 
cally. In contrast, we show that, for any sampling scheme satisfying condi- 
tions (A0)-(A2), it is in fact sufficient to deal with the polyhedral cone Ca, 
which is simpler to describe and analyze, both algorithmically and in theory; 
see the supplementary material Fienberg and Rinaldo (2012). In Rinaldo, 
Petrovic and Fienberg (2011) we provide various examples of how Theorem 3 
can be used to simplify the task of characterizing existence of the MLE for 
otherwise complicated models for networks and random graphs. These par- 
ticular models are based on product multinomial sampling constraints, in 
which case Theorem 3 yields what is known in polyhedral geometry as the 
Cayley trick. 

3.2. Parameter estimability. We now turn to the issue of estimability of 
the natural and log-linear parameters when the MLE does not exist. In our 
analysis, we rely on the key notion of facial sets, originally introduced in 
a slightly different form by Geiger, Meek and Sturmfels (2006). 

Definition 4. For a log-linear subspace Ai, a set J- C I is a facial set 
of Ai, when, for some // € Ai, 

= if i G T, 
H(i) < lii^F. 

Equivalently, J 7 is a facial set of Ai when, for any design matrix A for Ai 
(not necessarily of full column rank) , there exists some c£l (i such that 

(aj,c) = if % G T , 

(10) 

(aj,c) < Mi^T, 

where a^ denotes the ith row of A. Facial sets encode combinatorial and 
geometric properties of the log-linear subspace Ai which turn out to be 
crucial to our analysis. We summarize these properties in the next lemma. 

Lemma 5. Let A be a design matrix of Ai. The lattice of facial sets of AA 
is isomorphic to the face lattice of the marginal cone C\ . In particular, T is 
a facial set of Ai if and only if {aj , i £ J 7 } span the face of Ca isomorphic 
to T. 

Using this result, we can paraphrase Theorem 3 as follows [compare with 
Theorem 3.2 in Haberman (1974)]: 
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Corollary 6. The MLE exists if and only i/supp(n) is not contained 
in any facial set of Ai . 

We describe algorithms for determining facial sets and for using the pre- 
vious corollary in Fienberg and Rinaldo (2012). 

3.2.1. Estimability of the natural parameters. In this section, we rely 
on arguments proposed in Rinaldo, Fienberg and Zhou (2009) to study the 
estimability of the natural parameters. Let Cy denote the convex support of 
the family arising from a conditional Poisson scheme specified by a constraint 
matrix V; see Appendix A. Suppose that the observed sufficient statistics t = 
A T n belong to the relative interior of face Fy of Cy of dimension dp. Thus, 
the MLE of the natural parameters for the original family, supported on 
S*(V), is nonexistent, but the MLE of the natural parameter of the extended 
family supported Fy is well defined. Theorem 7 below generalizes Lemma 1 
by showing that, when the MLE does not exist, the linear combinations of 
the natural parameters that are estimable are determined, not only by the 
deterministic linear subspace arising from the sampling constraints, but also 
by the random linear subspace spanned by the normal cone to the face F 
of the marginal cone Ca containing A T n in its relative interior. As for the 
log-linear parameter, nonexistence of the MLE entails that only points in 
7Tf(.M AT) are estimable, where T is the random facial set corresponding 
to F. 

In preparation for the result, we need to set up some additional notation. 
By Lemma 15 in Appendix B, there exists one face F of Ca of dimension 
m + dF that contains Fy, with facial set T . Let Np be the normal cone to F 
and Cf CM. d be the linear subspace spanned by Np, so that dim(Cp) = 
d — m — dp (recall that, without loss of generality, we assume Ca to be 
full-dimensional). We further define the linear subspace 

M F :={A(3,(3eZ + C F }, 
where Z is given in (7). Just like in Lemma 1, we define the following equiva- 
lence relation on W d : 9\ 62 if and only if 6\ — 82 £ Z + Lp, and write 0a/> 
for the equivalence class containing 6. Finally, Q_\f F := {0j^ F ,6 G M. d }. 

Theorem 7. Consider the exponential family (5), with A of full rank d, 
and suppose that conditions (A0)-(A2) hold. Let Fy be a face of the convex 
support and J- the corresponding facial set of the normal cone. 

(i) For any G R d , the set 0j^ F is an affine subspace of W d of dimension 
m + dim(£^) = d — dp. The set Qj\f F is a dp -dimensional dimensional 
vector space isomorphic to irp(A4 Q M) and is comprised of parallel 
(d — dp) -dimensional affine subspaces ofM. d . 
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(ii) The extended family corresponding to Fy is non-identifiable: any two 

points 6\ rS 2 specify the same distribution. In fact, the family is 
parametrized by Qj\f F , or, equivalently, by Q N). Therefore, it 

is of order dp- 

The main point of Theorem 7 is that only natural parameters in Ojv> 
[or the log-linear parameters in ttj^{M OA/")] are estimable, with both sets 
being now random. In principle, nonidentifiability of the natural parame- 
ters, due to a nonexistent MLE, can be resolved using the same procedure 
of reduction to minimality described in the remarks following Lemma 1: 
identify a set of linearly independent vectors in IR 1 spanning M nJ\fp~, and 
use them to build a new design matrix of dimension I x dp. However, unlike 
the reduction to minimality carried out to remove the effect of the sampling 
constraints, which is design-dependent but not data-dependent, this reduc- 
tion depends on the random subspace Mf (the randomness arising from the 
exposed face F). Furthermore, while the sampling constraint reduction is 
easy to implement, since the matrix V is known, this second reduction re- 
quires us to compute a basis for £p, the linear space spanned by the normal 
cone to F. For the mean value parameter, the problem is to compute the 
facial set associated to the face F based solely on the observed sufficient 
statistics t, which amounts to identifying the face of Ca containing t in its 
relative interior. In general, both of these tasks are highly nontrivial, due to 
the combinatorial complexity of the face lattice of Ca; see the examples in 
Section 4. In the supplementary material [Fienberg and Rinaldo (2012)], we 
describe algorithms for accomplishing these tasks. 

As a corollary to Theorem 7, we can obtain each family in the extended 
family via a conditional Poisson sampling scheme that forces the base mea- 
sure to be supported on Fy , or equivalently, by requiring that the cells in F c 
have zero probability of containing positive counts. In this case, it is clear 
that assumption (Al) is violated. As a result, we can view each such family 
as a log-linear model under Poisson sampling scheme containing structural 
zeros along the (random) coordinates J- c . This is in fact consistent with the 
interpretation by Barndorff-Nielsen (1978), page 156, of the extended MLE 
as a conditional MLE, given that sufficient statistics lie on the boundary of 
the convex support. We formalize this observation in the next result. 

Corollary 8. Each face F of Cy of dimension < dp < d — m can 
be obtained as the convex support corresponding to the conditional Poisson 
scheme with constraint subspace Nf, where drm(.A/j?) = d — dp. 

Using the same arguments as in the remarks following Lemma 1, we 
also see that the Fisher information matrix at the extended MLE has rank 
dp < d, and therefore, is rank-deficient. This remains the case, even after 
accounting for the sampling constraints. Statistically, the singularity of the 
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observed Fisher information implies that the standard errors are not defined. 
From an algorithmic standpoint, this observation implies that the Newton- 
Raphson method for computing the MLE is bound to run into numerical 
instabilities, due to the fact that the Hessian matrix of the log-likelihood 
function is singular at any optimum [an issue illustrated empirically in Fien- 
berg and Rinaldo (2007)]. Furthermore, Corollary 2.8 in Rinaldo, Fienberg 
and Zhou (2009) shows that, under a nonexistent MLE, every point in the 
normal cone Np to the face F containing the observed sufficient statistics 
is a (random) direction of recession of the negative log-likelihood function, 
so that there are infinitely many directions of maximal increase of the log- 
likelihood function. 

3.2.2. Estimability of the mean value parameters under Poisson and prod- 
uct multinomial schemes. We now specialize our analysis to the case of 
Poisson and product multinomial sampling schemes. Besides their popular- 
ity, the main reason for focusing on these two particular sampling schemes is 
that the estimates of the cell mean values are highly interpretable. Under the 
Poisson scheme, the cell mean values are just the expected cell counts, while 
under the product multinomial scheme they are the conditional expectations 
of the cell counts given the grand total (in the multinomial case) or given the 
total counts in the portions of the table associated with the partitions used 
to define the product multinomial constraints. For other conditional Poisson 
sampling schemes, not only are the conditional cell mean values difficult to 
compute due to the unknown normalizing constant, but they are also less 
interpretable. 

Following Lauritzen (1996), we consider M = cl({exp(jit), /j, & M.}), the 
closure of the set of all cell mean values for a log- linear subspace M. . Thus, 
m G M if and only if m = lim n e^™ , for some sequence {fj, n } n C M . Lauritzen 
(1996) calls the set M the extended log-affine model. 

Theorem 9. Let t be the observed sufficient statistics, and let F be fa- 
cial set corresponding to the face of Ca containing t in its relative interior. 
The MLE of the cell mean vector exists, is unique and identical under Pois- 
son and product multinomial if and only if F = X. If F Cl, there exists 
one point m c in M such that m e = lim n exp(/x n ), where {^ n } n C M is any 
optimizing sequence such that 



Furthermore, supp(m e ) = F and II_A^n = Ll^m . 

This result shows that, for any observed table n, the log-likelihood func- 
tions in both sampling schemes admits always a unique maximizer, m c . 
Though supported only on the facial set associated with t, this vector ex- 
hibits exactly the same features as the "ordinary" MLE: it is the unique 




P-eM 



MAXIMUM LIKELIHOOD ESTIMATION IN LOG-LINEAR MODELS 



15 



point fh c G M such that A T m e = A T n and provided that J\f C Ai, maxi- 
mizes both the Poisson and product multinomial likelihoods. The substantial 
difference is that m e has positive coordinates only along the cells in the fa- 
cial set J- . Theorem 9 generalizes Theorem 4.8 in Lauritzen (1996). The 
improvement consists of identifying exactly the supports of the limit points 
in M, which are precisely the facial sets of Ca. 

Definition 10. The vector m c is the extended MLE of m and the zeros 
appearing in along the coordinates in T c = X\T are called the likelihood 
zeros. 

The term likelihood zeros highlight the fact that those zero counts, though 
arising as sampling and not as structural zeros, have a significant impact on 
the likelihood function and its optimizers. 

3.3. The geometry of the extended Poisson family. The results of The- 
orem 9 suggest that, for the Poisson and product multinomial schemes, we 
could, in fact, take the set M to be the cell mean value parameter space for 
the extended exponential family of distributions for the actual contingency 
table, not its sufficient statistics. We formalize this idea by relying on geo- 
metric considerations. For ease of readability, and without loss of generality, 
we focus on the Poisson sampling scheme, and only sketch how our results 
apply also to product multinomial cases. 

For a vector u G M. x , let 

u + = {max{u(i), 0}, i G 1} and u~ = {min{u(i), 0}, i G I}, 

so that u = u + — u~ and supp(u + ) n supp(u~) = 0. Furthermore, for any 
pair of nonnegative vectors x and u in K , write 

x u = JJx(i) u « 

i 

for the associated monomial. Following Geiger, Meek and Sturmfels (2006), 
page 1469 and Lemma A.l, we consider the toric variety X_m corresponding 
to the log-linear model Ai. 

Definition 11. The nonnegative toric variety Xj^t associated to the 
log-linear subspace Ai is the set of all vectors x G ]R> such that 

(11) x u+ =x u ~ VueM 1 . 

Geometrically, Xj^i is the intersection of the solution set of a system of 
polynomial equations with the nonnegative orthant. It is easy to see that 
any m > such that log(m) G M satisfies (11). Equation (11) can still hold, 
however, when some of the coordinates of m are zero. Finally, for any £ G Ca, 
consider the polyhedron 

(12) P£ = {xGMf :Ax = £}. 
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For a given sufficient statistic t = An, the set of lattice points in Pt, known 
as the fiber of t, consists of all possible tables having the same sufficient 
statistics as the observed table n. 

Theorem 12. (i) M = X M . 

(ii) For any nonzero m £ X_m, supp(m) is a facial set of Ca- 

(iii) The linear map A : M. x — > W d , given by mi— > Am, defines a homeomor- 
phism between Xj^ and Ca- 

(iv) For any observable sufficient statistic t = An, {m c } = X_m n -Pt and 
m c G ri(P t ). 

Part (i) of Theorem 12 is due to Geiger, Meek and Sturmfels (2006), while 
a slightly less general version of part (iii) is a standard result in the algebraic 
statistics literature; see, for example, Pachter and Sturmfels (2005), Drton, 
Sturmfels and Sullivant (2009). 

Overall, Theorem 12 shows that the set M is homeomorphic to the margin- 
al cone Ca and, therefore, as anticipated, we can use it as a legitimate mean 
value space for the extended family of the cell counts. The advantage of M 
over Ca is its direct interpretability in terms of cell mean values. This result 
extends directly to the multinomial sampling scheme. In this case, A specifies 
a homeomorphism between {x £ Xj^ : X^ x « = 1} an d -Pa = conv(A), which 
is known in algebraic geometry as the moment map; see Fulton (1993), Ewald 
(1996). In fact, under multinomial scheme, the extended mean-value space 
can be taken to be the intersection of X_m with the probability simplex 
in R 2 '. Furthermore, since M contains the constant functions, Pa an d Ca 
have identical facial sets. For product multinomial sampling schemes, a char- 
acterization of the mean value space analogous to the one given in Theo- 
rem 12 is also possible, though somewhat more involved. We refer the reader 
to Morton (2008) for details and a different derivation. In this particular 
case, the convex support arises as a Minkwoski sum of polytopes, one for 
every multinomial. Then, the proof of Theorem 3 reveals that facial sets of 
the convex support are also facial sets of the marginal cone, even though 
the opposite is not true. See Rinaldo, Petrovic and Fienberg (2011) for an 
application of these results to network models. 

Finally, part (iv) of Theorem 12 shows that the extended MLE is the only 
point in Pt satisfying the log-linear model conditions. This result can be also 
interpreted in terms of I-divergence projections [Csiszar (1975, 1989)], and 
provides the geometric basis for showing convergence of iterative methods for 
extended maximum likelihood estimation such as the iterative proportional 
scaling algorithm of Darroch and Ratcliff (1972). In the interest of space, 
we do not pursue this analysis. 

4. Inference under a nonexistent MLE. We have shown that when the 
MLE does not exist, only some of the model parameters (both under the nat- 
ural and mean-value parametrization) are estimable, and we have identified 
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the parameters that can instead be estimated within the extended family. 
Thus, when the MLE is nonexistent, statistical inference is still feasible, but 
only for the reduced family whose parameters are fully estimable. 

As described at the end of Section 3.2.1, we can obtain the relevant ex- 
tended exponential family by computing a new random design matrix Ajr 
whose column span is irjr(A4 QAf), where F is the random facial set corre- 
sponding to the face F of the marginal cone containing the sufficient statis- 
tics in its relative interior. We can then use this new design matrix to specify 
a new exponential family as in (5), where only the cells F have positive prob- 
ability of being observed. We carry out inference within this extended family 
or, equivalently, conditionally on the sufficient statistics being on the face F, 
as advocated by Barndorff-Nielsen (1978), page 156. By Corollary 8, this is 
equivalent to treating the coordinates in F as if they were structural zeros. 
Thus, dealing with a nonexistent MLE reduces, in practice, to fitting the 
same log-linear model under the additional (random) constraints that the 
cells in F c , which are not estimable, be treated as structural zeros. The 
same approach is also advocated in Geyer (2009). In practice, this entails 
replacing the MLE with the extended MLE and, quite importantly, adjust- 
ing the number of degrees of freedom, now to be computed as the difference 
between the cardinality of the facial set \F\ (i.e., the number of cell mean 
values that can be estimated), and the number of estimable parameters, 
namely dim(-7rj-(A / l © AO) — dim(F) — m. Using the adjusted number of de- 
grees of freedom, asymptotic \ 2 tests for goodness of fit [see, e.g., Read and 
Cressie (1988)] can then still be applied. Algorithms for carrying out the 
numerical tasks just described are presented in the supplementary material 
[Fienberg and Rinaldo (2012)]. 

5. Examples of likelihood zeros. Below, we illustrate by means of ex- 
amples various practical aspects of goodness-of-fit testing when the MLE is 
nonexistent, and we show how to appropriately adjust the number of degrees 
of freedom. We will focus on hierarchical log-linear models [see, e.g., Bishop, 
Fienberg and Holland (1975)], and refer the reader to Dobra et al. (2009) 
and Rinaldo, Petrovic and Fienberg (2011) for other examples of this kind. 

Our polyhedral characterization of the conditions for the existence of the 
MLE permits to generate novel examples of patterns of sampling zeros caus- 
ing nonexistence of the MLE for hierarchical log-linear models without pro- 
ducing null margins, an instance that is virtually ignored in all statistical 
software. As pointed out by Fienberg and Rinaldo (2007), the R [R Devel- 
opment Core Team (2005)] routines loglin and glm, as well as virtually 
any other software for inference and model selection for log-linear models, 
does no detect nonexistence and report the unadjusted, incorrect, numbers 
of degrees of freedom for all the examples below. In the analysis of sparse 
tables, it is also common practice to add small positive quantities to the zero 
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cells, in order to avoid numerical issues with the computation of the MLE. 
We remain highly skeptical of the numerical advantages of this ad-hoc pro- 
cedure, and remark that such adjustments will make it impossible to detect 
nonexistence of the MLE and to distinguish the estimable parameters. 

The examples of likelihood zeros in Examples 2-4 suggest that the combi- 
natorial complexity of hierarchical log-linear models, measured by the num- 
ber of facets of the marginal cone, can be quite significant. In the reported 
examples, as well as in many other experiments we conducted, for many mod- 
els the number of facets associated with zero margins appears to be much 
smaller than the total number of facets, indicating that, at least combinato- 
rially, likelihood zeros associated to positive margins are much more frequent 
(though never detected). Below we use the classic notation to represent the 
generating class of a hierarchical log-linear model; for example, see Bishop, 
Fienberg and Holland (1975). Empty cells indicate positive counts. All the 
calculations were carried out in polymake [Gawrilow and Joswig (2000)]. 

Example 1. The 2 3 table and the model [12] [13] [23] of no-second-order 
interaction. The MLE is not defined because the two likelihood zeros expose 
one of the 16 facets of the marginal cone. This example, due to Haberman 
(1974), was the only published example a log-linear model with nonexistent 
MLE and positive margins; see Fienberg and Rinaldo (2007), Section 5, 
for a general result concerning binary K-w&y tables and the model of no- 
(K — l)st interaction. 























The dimension of the log-linear subspace for this model, or, equivalently, of 
the marginal cone, is 7, leaving 1 degree of freedom when the MLE exists. 
However, because of the likelihood zeros, inference can only be made for the 
6-dimensional exposed facet. Since the cardinality of the associated facial 
set T is also 6, the resulting extended log-linear model is the saturated 
model on J-. 



Example 2. The 3 3 table and the model [12] [13] [23]. The MLE is not 
defined because the pattern of likelihood zeros exposes one of the 207 facets 
of the marginal cone. Of all the facets, only 27 are associated to zero margins. 





































































The dimension of the facet is 18, which is also the cardinality of the facial 
set for this configuration of likelihood zeros. As in the previous example, 
this defines the saturated model on J 7 , giving adjusted degrees of freedom 
and making x 2 approximations not applicable. 
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Under the same log-linear model, the MLE does not exist also when the 
following pattern of zeros arises: 




































































In this example, the zeros displayed in bold are not likelihood zeros, but the 
others are. Indeed, their presence or absence has no effect on the existence of 
the MLE. Furthermore, when the extended MLE is computed, the boldfaced 
zero counts will be replaced by positive entries, while the likelihood zeros will 
stay zero. The number of degrees of freedom in this example is 3, because 
the total number of estimable cell mean values is 21, and the number of 
parameters for the reduced model is 18. 

In our last example, the MLE is defined, despite the table being very 
sparse, because no facet of the marginal cone is exposed [source: Fienberg 
and Rinaldo (2007)]. 














































































Example 3. The 4x4x4 table and the model [12] [13] [23]. The MLE 
is not defined because the pattern of zeros exposes one of 113,740 facets 
of the marginal cone [source: Eriksson et al. (2006)]. Of these, only 48 are 
associated to zero margins. 































































































































































Example 4. The 3 4 table and the 4-cycle model [12] [14] [23] [34]. The 
MLE is not defined because the pattern of zeros exposes one of the 1116 
facets of the marginal cone. Of these, only 36 are associated to zero margins. 
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6. Algorithms for extended maximum likelihood estimation. In the sup- 
plementary material [Fienberg and Rinaldo (2012)], we apply the theory 
developed in this article to develop efficient algorithms for extended max- 
imum likelihood estimation in log-linear models under Poisson and prod- 
uct multinomial schemes [for which the key integrality assumption (A2) is 
satisfied] that are applicable to high-dimensional models and large tables. 
Some of these algorithms are implemented in a MATLAB toolbox available at 
http://www.stat.cmu.edu/~arinaldo/ExtMLE/. The final output of our 
procedure is the set of estimable mean value and natural parameters. 

APPENDIX A: EXTENDED EXPONENTIAL FAMILIES 

In this appendix we provide a brief review of the theory of extend families 
and its relevance for log-linear models. Along with classic references on ex- 
ponential families [Barndorff-Nielsen (1978), Brown (1986), Cencov (1982), 
Letac (1992)] and generalizations by Csiszar and Matus (2001, 2003, 2005, 
2008), we refer the reader to Rinaldo, Fienberg and Zhou (2009) and Geyer 
(2009) for treatments more directly relevant to our problem. 

Consider a log-linear model under conditional Poisson sampling scheme 
specified by a sampling matrix V of rank m and a design matrix A of the 
form (8), where B is of full-rank d — m. Then [see equation (6)], the dis- 
tribution of the sufficient statistic z = B T n form an exponential family of 
distributions So v on M. d ~ m with densities 

? *(z)=exp((z,0)-^(0)), 0eO, 

with respect to the base measure fly = ^v" 1 ^' an< ^ parameter space = 
M d_m . The convex support Cy of £c v is the closure of the convex hull of the 
support of fjty. In particular, P is a full-dimensional polyhedron in M d_m and, 
for every face F of Cy, F is the convex hull of some points in the support 
of fly. Given a realization z of the sufficient statistics, the random set 

(13) 0(z) =0= id* G : qg* (z) = sup q e (z)\ 

is the maximum likelihood estimator (MLE) of 0. If 9 = we say that the 
MLE does not exist. Existence of the MLE is fully characterized by the ge- 
ometry of Cy, as the following well-known result indicates; see, for example, 
Theorem 5.5 in Brown (1986) or Theorem 9.13 in Barndorff-Nielsen (1978). 

Theorem 13. For a minimal and full exponential family, the MLE 6 
exists and is unique if and only if z G ri(P). 

Setting £(0) = f R d- m zqg(z) dfiy(z), because of the minimality of £c* v > one 
obtains the fundamental identity Vip{6) = £ t {6),\/0 £ 0, where V indicates 
the gradient. In particular, if the MLE exists, it satisfies the equation 9 = 
(V^) _1 (z), which is equivalent to the moment equation $,(0) =z. 
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For any proper face F of Cy, let /iy be the restriction of fly to .F. Then, 
^iy determines a new exponential family of distributions, £p, with densities 
with respect to /iy given by 

^(x)=exp((z,0)-^ F (0)), 9ee f , 

where the natural parameter space is ©i? = {# £ : exp(ifi F (0)) < 00} = 0, 
with ip F (0) = log / K d-m exp((z, 0)) d//y (z). The convex support of this new 
family is F, and the existence result of Theorem 13 carries over: the MLE 
exists if and only if the observed sample z belongs to 11(F). However, since £p 
is supported on a lower-dimensional affine subspace of M. d ~ m of dimension 
dp = dim(i ? ), it is no longer minimal, hence it is unidentifiable. Nonetheless, 
if z £ ri(F), the MLE of is the set consisting of those satisfying the first 
order optimality conditions 

(14) z = Vip F (6) or, equivalently, £ F (0)=z, 

where £ F (0) = J Rd _ m zq F (z) d/i^(z). 
The collection of distributions 

£ = \J£f 

F 

as F ranges over all the faces of Cy, including Cy itself, is known as the 
extended exponential family of distributions. With respect to such family £, 
for any observed sample z, the MLE, or extended MLE, is always well defined 
and is the set of solutions to (14), where F is the unique face containing z 
in its relative interior. 

APPENDIX B: PROOFS 

This appendix contains the proofs of some results stated in the article. 
The remaining proofs can be found in the supplementary material Fienberg 
and Rinaldo (2012). Throughout, we assume familiarity with basic notions 
of polyhedral geometry; see Ziegler (1995), Schrijver (1998) and Rockafellar 
(1970) for in-depth treatments, and Section 2.1 of Rinaldo, Fienberg and 
Zhou (2009) for a brief review of the concepts directly relevant to our setting. 

Proof of Theorem 3. We first assume that A is of full rank d. If N = 
{0}, then the convex support is the d-dimensional polyhedral cone C\, so the 
result follows directly from Theorem 13. Thus, throughout the remainder of 
the proof we consider the case < dim(AA) < d. For now, we further assume 
that A is of the form (8). 

By standard minimality arguments, we can work with the exponential 
family supported on S = {z : z = B T x, x E N x , V T x = 1}. By assumption (Al), 
the convex support Cy, which is the closure of the set conv(5), is a full- 
dimensional polyhedron in M. d ~ m . In particular, the parameter space is M. d ~ m . 
The MLE exists and is unique if and only if z S ri(CV) by Theorem 13. We 
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now show that this happens if and only if t G ri(CA). We first use the inte- 
grality assumption (A2) to obtain a simpler representation of Cy. 

Lemma 14. 

Cy = {B T x : x G M| , V T x = 1}. 

Proof. Since is a polyhedron (hence closed 

and convex), it must contain Cy. To show the reverse inclusion, let z* G 
{B T x:x 6 M^ Q ,\ T x = 1 1- Then ' z * = BTx * for some x* G {x:x G R^q, 
V T x = 1}. By the integrality assumption (A2), 

x* Gconv({x:xGN Z ,V T x=l}), 

which by linearity implies that z* G conv(B T x : x G N 1 , V T x = 1}) C Cy, as 
claimed. □ 

For design matrices of the form (8), the claim in the theorem follows 
directly from the next lemma. 

Lemma 15. There exists a homomorphism from the face lattice of Cy 
to the face lattice of Ca that associates to each face of Cy of dimension oIf 
the (unique) face of Ca of dimension m + dF containing it. 

Proof. Instead of concerning ourselves with Cy, we find it convenient 
to deal with the d — m-dimensional polyhedron in M. d 

(15) Ty = C A n {t = (h, . . . , t d ) T eM d :t j = l,j = d-m+l,...,d}. 

In light of the next result, Ty and Cy have the same combinatorial proper- 
ties. 

Lemma 16. The polyhedra Ty and Cy are combinatorially equivalent. 

Proof. By Lemma 14, z G Cy if and only if ( ? ) G Ty. Thus the coordi- 
nate projection map ir : M. d — > M. d ~ m given by ir(xi, . . . , Xd) = (x\, . . . , Xd- m ) 
defines a bijection between Cy and Ty. Since tt is a linear mapping, Cy 
and Ty are affinely equivalent, hence combinatorially equivalent. □ 

It follows from Lemma 16 that there exists a bijection between Cy and Ty 
that is also a bijection between boundary points of Cy and points on the 
relative boundary of Ty in such a way that the face lattices of Cy and Ty 
are identical. Note also that isomorphic faces of the polyhedra have the same 
dimension. Therefore, it is sufficient to prove that the claim of the theorem 
holds for Ty instead of Cy. 

Using the ^-representation [see, e.g., Ziegler (1995), Schrijver (1998)] we 
write 

(16) C A = {t G R d : Ct < 0} 



MAXIMUM LIKELIHOOD ESTIMATION IN LOG-LINEAR MODELS 



23 



for some matrix C, where we can assume that no inequality is redundant. 
In particular, any face F of Ca of co-dimension k can be written as 

{t:Ct<O,( Ci ,t) = 0,i = l,. ..,*}, 

where (ci, . . . , c^) are the k rows of C that define the k supporting hyper- 
planes whose intersection with Ca is precisely F. Define 

T=[0 I m ], 

where is the m X (d — m) matrix of zeros, and \ m is the m x m identity 
matrix. Thus, Ty is the set of points in M. d given by {t : Dt < b}, with 



D 



c 

T 

-T 



and b 



where C is the sub-matrix of C obtained by removing the rows correspond- 
ing to inequalities that may have become redundant once the sampling con- 
straint are enforced. These inequalities are the precisely the defining inequal- 
ities for the facets that do not intersect the affine space {t : Tt = 1}. Notice 
that, by (Al), the dimension of Ty is equal to d minus the rank of 

T 

-T 

which is m. Next, any face F of Ty of co-dimension k can be written as 

F={t:Dt<b, (d i ,t)=0,j = l,...,fc}, 

where (di, . . . , d^) are the k rows of C that define the k supporting hyper- 
planes of F. Since the points in F satisfy all the inequalities (16), it fol- 
lows that F is contained in the set F' = {t : Ct < 0, (d,-,t) = 0, j = 1, . . . , k}, 
which is a face of Ca of co-dimension k. It is also immediate to see that F' 
is the smallest such face. Furthermore, if C is a different face of Ty of 
co-dimension k, it is defined by a different set of equalities, so it is con- 
tained in a different face of Ca (of co-dimension k). If G is instead of co- 
dimension k' > k and is also a face of F, then, G = {t :Dt < b, (dj,t) = 0, 
j = 1, . . . ,k, . . . , k'}, so that G is contained in the set {t : Ct < 0, (dj,t) = 0, 
j = 1, . . . , k'}, which is a face of Ca of co-dimension k! and also a face of F' . 

Therefore, the mapping that associates to each face of Ty the smallest 
face of Ca containing it (and of the same co-dimension) is a lattice homo- 
morphism from the face lattice of Ty to the face lattice of Ca- Furthermore, 
since the homomorphism just described is between faces of the same co- 
dimension, and dim(Tv) = d — m while dim(G0 = d, each face of Ty of 
dimension dp is mapped to a face of Ca of dimension m + dp. □ 
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Thus far we have assumed that the design matrix A is of full rank and has 
the form specified by equation (8). Now let A' be any design matrix with row 
span A4, not necessarily of the form (8), or not even of full rank. Then, Cpj 
is also a polyhedral cone of dimension d, though its ambient dimension may 
be larger. As A' and A have the same null space, the cones Cpj and Ca are 
afhnely isomorphic, hence combinatorially equivalent. Thus, t' = (A') T x G 
ri(CA') if and only if t = A T x, which shows that the theorem holds for any 
generic design matrix A. □ 

Proof of Theorem 9. We show that, under both Poisson and prod- 
uct multinomial scheme, the MLE exists, is unique and is identical in both 
cases if and only if t = An is a point in the relative interior of Ca • If t be- 
longs to the relative interior of a face F, then both log- likelihood functions 
realize their suprema along sequences of points [i n C Ai for which the limit 
exp(/x n ) = m c is unique, satisfies the moment equations n^n = n^n and 
supp(m) = T . 

First, we consider the problem of maximizing the log-likelihood £ p (fJ<) = 
(n, jLt) — Siex ex P(/ x (*)) under Poisson sampling scheme. Suppose t = A T n 
lies inside the relative interior of a proper face F of Ca with corresponding 
facial set T . Then, there exists a zp G kernel(A) = M. 1 - such that the vector 
x F = n + zp satisfies t = A T x F and supp(n + zp) = T . Furthermore, since, 
for any (j,eM, (z f ,/i) = 0, £ p (fj,) = (x F ,/x) - £ i6X exp(/i(i)). 

Define £ p and £ p c to be the restriction of i p on irp(Ai) and 7rj-c(7W), 
respectively. Explicitly, 

^O) = (xjj 7Tjr(/x)) - exp(/x(i)) = (x F , fx) - Y exp(/x(i)) 

and i^ c {n) = -E Je ^exp(/x(i)). Therefore, £ p (fi) = £ p {[i) + l^c(p). On 
ttf(M), the function £ p is bounded from above, continuous and strictly 
concave, so it is maximized by the unique point fi*-p G ttj^(M) that satisfy 
the first order optimality conditions on the differential of £ p [see Haberman 
(1974), Chapter 2] given by 

(17) (Ajr,exp(Ai3r)) = (A^70r(x F )) = (Ajr,n) VA^GttHK), 

where the second equality holds since x F G A4 ± and supp(AV) = T . 

On the other hand, on ttj^c(M), the function £ p c is negative and strictly 
decreasing in each coordinate of its argument. Thus, 

SU P 4v) < SU P = ^(Mjf)- 

We now show that the above inequality is in fact an equality by finding 
a sequence {/x n } C M. such that 

lim£ p (/i n ) = ^(/,». 
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To this end, let n* be any vector in M such that irjr([i*) = fi^. Next, since J- 
is a facial set, there exists a sequence {f n } C Ai such that: 

(i) if i € F, then 7 n (i) = 0, for all n; 

(ii) if i £ J 7 ' 1 , then 7 n (i) < for all n and lim n 7 n (z) = — oo (the rate at 
which these series diverge to infinity being arbitrarily fast). 



/J,*(i) if i G F, 
— oo if i £ F c , 



Define the sequence {/x n } C M. as [J, n = fi* + 7 n . Then, 
lim/x n (i) = 

n 

from which it follows that 

lim£ p (n n ) = lim^(7r^(/i n )) +lim^ c (7T^c(^ n )) 

n n n 

n 

as desired, since 

\im£jr c (TT T c(^ )) = V limexp(/x n (i)) = 0. 

n ^ — » n 

ie.F e 

Set m c = lim n exp(/x n ), and notice that m c is the unique vector in M. x such 
that 

7rj-(m c ) = exp(^3r), 
vr^(m c )=0, 

where uniqueness stems from the uniqueness of /ij- (it is clear that, while m c 
is unique, the sequence {/x n } is not). Furthermore, m c is random, as it 
depends on the facial set F associated to the face of Ca exposed by t = A T n. 
Finally, in virtue of the fact that supp(n) C F, we see that, for any A G M, 

(A, m e ) = (Aj-, exp(fi* T )) and (Ajr, n) = (A, n) 

so that, using (17), m e can be characterized as the unique point in M such 
that 

(A,m c ) = (A,n) VAG.M, 

or, equivalently, 

(18) A T m c = A T n or n^m^n^n. 

If we instead want to maximize the log-likelihood function £ M under prod- 
uct multinomial sampling, we need to consider only the points fi inside M 
as in equation (4). Fortunately, this restriction is inconsequential. First note 
that, by (18) and because M C A4, the limit fi* satisfies the constraints 
{(Xj, exp(/x*)) = Nj,j = l,...,r}. Next, since £ and £ p differ by a con- 
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stant on M and A4 C Ai , we have that 

e M (n*) = SU pj M (/i). 

p.eM 

We conclude that the log-likelihood functions under both the Poisson and 
product multinomial model must have the same maximizer fh. 

Finally, we note that if t 6 ri(CA), so that J 7 = 1, the arguments simplify. 
Explicitly, there exists a point /i* E M. C M. such that 

sup ^( M )=^V), 

supJ M (/i) =l M {v*) 
fi&M 

which we can obtain as the unique point m G £ M with supp(m c ) = Z satis- 
fying(18). □ 
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